General instructions for all assignments:
R Markdown file (named as: [AndrewID]-315-HW09.Rmd – e.g. “mneykov-315-HW09.Rmd”) to the Homework 09 submission section on Canvas. You do not need to upload the .html file.You should complete HW09 as a group. Only one submission is needed per group. Groups that submit multiple assignments may lose points at the instructors’ discretion.
library(tidyverse)
achidamb_315_theme <- theme_bw() + # White background, black and white theme
theme(axis.text = element_text(size = 10, color = "navy",family = "serif"),
text = element_text(size = 14, face = "bold", color = "navy"))
COLOR:
achidamb_color_palette <- c("#2D3184","#0082A6", "#4EBBB9", "#9CDFC2", "#D8F0CD","#F3F1E4")
(2 points each)
Parallel Coordinates and Radar Charts
There are no standard ggplot() geometries for creating parallel coordinates plots or radar charts, but there is an implementation in the GGally package.
Cars93 dataset. Color the lines by the Type of car. Code is partially completed for you below. Be sure to rotate the x-axis labels, update the legend, and add titles/axis labels:library(MASS)
library(tidyverse)
library(GGally)
data(Cars93)
cont_cols <- which(names(Cars93) %in%
c("Cars93", "Price", "MPG.city", "MPG.highway", "EngineSize",
"Horsepower", "RPM", "Fuel.tank.capacity", "Passengers",
"Length", "Wheelbase", "Width", "Turn.circle", "Weight"))
ggparcoord(Cars93, columns = cont_cols) + aes(color = factor(Type)) + coord_flip() + labs(title = "Value vs Variables by Type of Car", ylab = "Variables", xlab = "Value" ) +
achidamb_315_theme
Car type 4 gets better milage in MPG.highway and MPG.city than the rest of the car types. The values for car type 4 in these categories are around 4. Car Type 6 fits the most passengers and has a value close to 3 in the graph.
Repeat part (a), but create a radar chart instead. To do this, simply add + coord_polar() to your parallel coordinates code. Which plot is easier to read?
ggparcoord(Cars93, columns = cont_cols) + aes(color = factor(Type)) + coord_flip() +
labs(title = "Value vs Variables by Type of Car", ylab = "Variables", xlab = "Value") + coord_polar() +
achidamb_315_theme
scale parameter.) What could you change the scale parameter to in order to mimic the way parallel coordinates charts were introduced in class? Do this, and create a new graph showing the result.The default scale for y-axis is standard deviation, which allows us to compare how much different car types vary at each variable. The parameter that matches our introduction in class is uniminmax and we didn’t flip the coordinates.
ggparcoord(Cars93, columns = cont_cols, scale = "uniminmax") + aes(color = factor(Type)) +
labs(title = "Value vs Variables by Type of Car", ylab = "Variables", xlab = "Value") +
achidamb_315_theme
There aren’t any two types that are really positively correlated. The two most positively correlated car types are probably car types (1,3) and (2,3). The type 3 color lines are usually between type 2 and type 1, and they tend to follow a similar pattern with slightly different degree. The two pairs that are most negatively correlated are type (4,6) and (2,4). Type 4 is mostly at the opposite value as 2 and 6, for example, for the last five variables, type 4 cars are mostly at the bottom with values around 0, but type 2 adn 6 are all the way on the top with values close to 1.
cars_cont <- dplyr::select(Cars93, Price, MPG.city, MPG.highway, EngineSize,
Horsepower, RPM, Fuel.tank.capacity, Passengers,
Length, Wheelbase, Width, Turn.circle, Weight)
library(reshape2)
correlation_matrix <- cor(cars_cont)
melted_cormat <- melt(correlation_matrix)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile()+labs(title = 'Cars Correlation Heat Map', x = '', y = '') +
scale_fill_gradient2(low = "dark red", high = "dark blue",
mid = "light grey",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
The plot above is a heat map. It is a graphical representation of the correlations between all possible variables and uses a color gradient to show the strength/value of the correlation. It does so for a matrix of the variables giving us every possible pair of selected variables from the dataset.
This reminds me a lot of mosaic plots.
# Taken from guide
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}
get_upper_tri<-function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
correlation_matrix <- cor(cars_cont)
correlation_matrix <- reorder_cormat(correlation_matrix)
correlation_matrix <- get_upper_tri(correlation_matrix)
melted_cormat <- melt(correlation_matrix, na.rm=TRUE)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) +
geom_tile() + labs(title = 'Cars Correlation Heat Map', x = '', y = '') +
scale_fill_gradient2(low = "dark red", high = "dark blue",
mid = "light grey",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Correlation") +
geom_text(aes(Var1, Var2, label=sprintf("%0.2f", round(value, digits = 2))),
color = "green", size = 6) +
theme(text = element_text(size=15),
axis.text.x = element_text(angle = 90, hjust = 1))
(20 points)
Variable Dendrograms
Another way to visually explore potential associations between continuous variables in our dataset is with dendrograms.
Cars93 dataset. To do this:cormat <- 1 - abs(cormat)as.dist() function.hclust()), convert the result to a dendrogram (as.dendrogram()), then plot with ggplot().(5 points) Examine the four-cluster solution. Which variables are in the same cluster? Does it make sense that these are in the same cluster, given both your common-sense understanding of these variables and given the correlation plot you created in Problem 2?
(1 point) What other measures (other than correlation) could you use to measure similarity / dissimilarity between continuous variables for the purposes of a variable dendrogram? (There is not necessarily a right or wrong answer here – just brainstorm ideas.)
If you’re finding your graphic runs over the boundaries, try standard approaches of ylim and xlim changes. Additionally, sometimes the knitted file looks different - knit once before you run.
(2 points each)
Love Actually Character Network
Read this article from FiveThirtyEight. Write 2-3 sentences summarizing any methods of analysis that they used.
Load the Love Actually adjacency matrix from FiveThirtyEight’s GitHub Page. Store this in an object called love_adjacency. Convert this into a distance matrix, using \(1/(1+x)\) as a conversion function between the adjacencies and the distances. Use hierarchical clustering with average linkage (method = "average" in hclust()) and convert the result to a dendrogram. Visualize this with ggplot(), add appropriate titles/labels/themes/etc. (Code is partially provided to do this.)
library(dendextend)
love_adjacency <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/love-actually/love_actually_adjacencies.csv")
love_dist <- 1 / (1 + as.dist(love_adjacency[,-1]))
Interpret the resulting dendrogram. Which chacacters are connected in the movie?
Read about the ggraph package here. What does it do? When was it released? What ggplot() like function can you do with ggraph?
Read this post on adjusting the edges in ggraph. How would you create a dendrogram with the ggraph package? Use the example code at this link to create a dendrogram with this dataset using ggraph (NOT the same way that you created it in part (b)).
Create a basic network diagram of the Love Actually data using the ggraph package. Code is partially started for you below.
library(igraph)
library(ggraph)
names <- love_adjacency[,1]
graph <- graph_from_adjacency_matrix(as.dist(love_adjacency[,-1]))
(2 points each) Using the documentation at the link in parts (d), (e), and the ggraph GitHub page, make at least three adaptations to your graph from (f). For example, you might size the points, size the edges, use arcs (curved edges), use geom_edge_density, etc.
(BONUS: 3 points) Color the nodes of the graph by the gender of the actor/actress. Facet on the gender of the actor/actress.
(4 points each)
Waffle Charts
ggplot(). What is the purpose of a waffle chart? What would you use a waffle chart to visualize? (I.e. what type of data? How many dimensions/variables?)# Set up data to create the waffle chart
library(MASS)
data(Cars93)
var <- Cars93$Type # the categorical variable you want to plot
nrows <- 9 # the number of rows in the resulting waffle chart
categ_table <- floor(table(var) / length(var) * (nrows*nrows))
temp <- rep(names(categ_table), categ_table)
df <- expand.grid(y = 1:nrows, x = 1:nrows) %>%
mutate(category = sort(c(temp, sample(names(categ_table),
nrows^2 - length(temp),
prob = categ_table,
replace = T))))
# Make the Waffle Chart
ggplot(df, aes(x = x, y = y, fill = category)) +
geom_tile(color = "black", size = 0.5) +
scale_x_continuous(breaks = NULL) +
scale_y_continuous(breaks = NULL) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Waffle Chart of Car Type",
caption = "Source: Cars93 Dataset",
fill = "Car Type",
x = NULL, y = NULL) +
theme_bw() # Use your theme
Create a waffle chart for the content_rating variable in the imdb data from the lab exam. Use 25 rows. Then recreate the same graph, but use 50 rows. Which version of the chart do you prefer?
Critique these graphs. What are the issues with waffle charts?
(1 point each)
Arc Pie Charts
Install and load the ggforce package. This package implements several updates and improvements to ggplot2.
Type variable in the Cars93 dataset. (Code provided.)library(ggforce)
Cars93 %>% group_by(Type) %>%
summarize(count = n()) %>%
mutate(max = max(count),
focus_var = 0.2 * (count == max(count))) %>%
ggplot() + geom_arc_bar(aes(x0 = 0, y0 = 0, r0 = 0.8, r = 1,
fill = Type, amount = count),
stat = 'pie')
Adjust the r0 parameter to lower and higher values. What does this control? What is the minimum and maximum value?
Recreate the graph from (a), but this time, add explode = focus_var into your call to aes(). What does this do?
Recreate the graph from (c), but this time, add focus to the category with the minimum number of observations.
(4 points) Critique these graphs.
explode to focus on a particular variable?(5 points each)
Zoom Zoom
See the following code working with the IMDb movies dataset from Homework 7 for how to use facet_zoom().
library(tidyverse)
library(forcats)
library(devtools)
library(ggforce)
# Colorblind-friendly color pallette
my_colors <- c("#000000", "#56B4E9", "#E69F00", "#F0E442", "#009E73", "#0072B2",
"#D55E00", "#CC7947")
# Read in the data
imdb <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/imdb_test.csv")
# get some more variables
imdb <- mutate(imdb, profit = (gross - budget) / 1000000,
is_french = ifelse(country == "France", "Yes", "No")) %>%
filter(movie_title != "The Messenger: The Story of Joan of Arc")
france_1990 <- filter(imdb, country == "France", title_year >= 1990)
# this code plots a scatterplot + a zoomed facet
ggplot(data = imdb, aes(x = title_year, y = profit)) +
geom_point(color = my_colors[1], alpha = 0.25) +
geom_smooth(color = my_colors[2]) +
geom_point(data = france_1990, color = my_colors[3]) +
geom_smooth(data = france_1990, aes(x = title_year, y = profit),
color = my_colors[4], method = lm) +
facet_zoom(x = title_year >= 1990) +
labs(title = "Movie Profits over Time",
subtitle = "Zoom: French Movies from 1990 -- 2017 (orange/yellow)",
caption = "Data from IMDB and Kaggle",
x = "Year of Release",
y = "Profit (millions of USD)")
Also read the articles here, or here.
Recreate any scatterplot that we created throughout the year, and zoom in on a section of the graph via the facet_zoom() feature in the newest version of the ggforce package. Include a title, subtitle, and caption in the resulting graph. The caption should just state the data source, and the subtitle should explain what area of the plot is being enhanced via zooming.
Interpret the resulting graph: Describe some feature of the new version of the graph that you may not have been able to see very well in the previous version of the same graph (without zooming).